162 research outputs found
A hierarchical Dirichlet process mixture model for haplotype reconstruction from multi-population data
The perennial problem of "how many clusters?" remains an issue of substantial
interest in data mining and machine learning communities, and becomes
particularly salient in large data sets such as populational genomic data where
the number of clusters needs to be relatively large and open-ended. This
problem gets further complicated in a co-clustering scenario in which one needs
to solve multiple clustering problems simultaneously because of the presence of
common centroids (e.g., ancestors) shared by clusters (e.g., possible descents
from a certain ancestor) from different multiple-cluster samples (e.g.,
different human subpopulations). In this paper we present a hierarchical
nonparametric Bayesian model to address this problem in the context of
multi-population haplotype inference. Uncovering the haplotypes of single
nucleotide polymorphisms is essential for many biological and medical
applications. While it is uncommon for the genotype data to be pooled from
multiple ethnically distinct populations, few existing programs have explicitly
leveraged the individual ethnic information for haplotype inference. In this
paper we present a new haplotype inference program, Haploi, which makes use of
such information and is readily applicable to genotype sequences with thousands
of SNPs from heterogeneous populations, with competent and sometimes superior
speed and accuracy comparing to the state-of-the-art programs. Underlying
Haploi is a new haplotype distribution model based on a nonparametric Bayesian
formalism known as the hierarchical Dirichlet process, which represents a
tractable surrogate to the coalescent process. The proposed model is
exchangeable, unbounded, and capable of coupling demographic information of
different populations.Comment: Published in at http://dx.doi.org/10.1214/08-AOAS225 the Annals of
Applied Statistics (http://www.imstat.org/aoas/) by the Institute of
Mathematical Statistics (http://www.imstat.org
MildInt: Deep Learning-Based Multimodal Longitudinal Data Integration Framework
As large amounts of heterogeneous biomedical data become available, numerous methods for integrating such datasets have been developed to extract complementary knowledge from multiple domains of sources. Recently, a deep learning approach has shown promising results in a variety of research areas. However, applying the deep learning approach requires expertise for constructing a deep architecture that can take multimodal longitudinal data. Thus, in this paper, a deep learning-based python package for data integration is developed. The python package deep learning-based multimodal longitudinal data integration framework (MildInt) provides the preconstructed deep learning architecture for a classification task. MildInt contains two learning phases: learning feature representation from each modality of data and training a classifier for the final decision. Adopting deep architecture in the first phase leads to learning more task-relevant feature representation than a linear model. In the second phase, linear regression classifier is used for detecting and investigating biomarkers from multimodal data. Thus, by combining the linear model and the deep learning model, higher accuracy and better interpretability can be achieved. We validated the performance of our package using simulation data and real data. For the real data, as a pilot study, we used clinical and multimodal neuroimaging datasets in Alzheimer's disease to predict the disease progression. MildInt is capable of integrating multiple forms of numerical data including time series and non-time series data for extracting complementary features from the multimodal dataset
A multivariate regression approach to association analysis of a quantitative trait network
Motivation: Many complex disease syndromes such as asthma consist of a large number of highly related, rather than independent, clinical phenotypes, raising a new technical challenge in identifying genetic variations associated simultaneously with correlated traits. Although a causal genetic variation may influence a group of highly correlated traits jointly, most of the previous association analyses considered each phenotype separately, or combined results from a set of single-phenotype analyses
Relative impact of multi-layered genomic data on gene expression phenotypes in serous ovarian tumors
- …